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Abstract 

A key issue in statistics and machine learning is to automatically select the 
"right" model complexity, e.g., the number of neighbors to be averaged over 
in k nearest neighbor (kNN) regression or the polynomial degree in regression 
with polynomials. We suggest a novel principle - the Loss Rank Principle 
(LoRP) - for model selection in regression and classification. It is based on 
the loss rank, which counts how many other (fictitious) data would be fitted 
better. LoRP selects the model that has minimal loss rank. Unlike most 
penalized maximum likelihood variants (AIC, BIC, MDL), LoRP depends only 
on the regression functions and the loss function. It works without a stochastic 
noise model, and is directly applicable to any non-parametric regressor, like 
kNN. 
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1 Introduction 



Regression. Consider a regression or classification problem in which we want to 
determine the functional relationship yi~/tme(^) from data D={(xi,yi),...,(x n , y n )}, 
i.e., we seek a function r(.\D) = r(D)(.) such that r(x\D) = r(D)(x) is close to the 
unknown /t ru e(^) for all x. One may define r(.\D) directly, e.g., "average the y 
values of the k nearest neighbors (kNN) of x in D" , or select r(.\D) from a class of 
functions T that has smallest (training) error onD. If the class J 7 is not too large, 
e.g., the polynomials of fixed reasonable degree d, this often works well. 

Model selection. What remains is to select the right model complexity c, like k 
or d. This selection cannot be based on the training error, since the more complex 
the model (large d, small k) the better the fit on D (perfect for d — n and k — 1). 
This problem is called overfitting, for which various remedies have been suggested. 

The most popular ones in practice are based on a test set D' used for selecting 
the c for which the function r c (.\D) has smallest (test) error on D', or improved 
versions like cross-validation [A1174j . Typically D' is cut from D, thus reducing the 
sample size available for regression. Test set methods often work well in practice, 
but the reduced sample decreases accuracy, which can be a serious problem if n is 
small. We will not discuss empirical test set methods any further. See |Mac92] for 
a comparison of cross-validation with Bayesian model selection. 

There are also various model selection methods that allow to use all data D 
for regression. The most popular ones can be regarded as penalized versions 
of Maximum Likelihood (ML). In addition to the function class J- c (subscript c 
belonging to some set indexing the complexity), one has to specify a sampling 
model P(D\f), e.g., that the yi have independent Gaussian distribution with mean 
f(xi). ML chooses r c (D) = argmaxj g jr c P(D|/), Penalized ML (PML) then chooses 
c = argmin c {— logP(-D|r c (D))+Penalty(c)}, where the penalty depends on the used 
approach (MDL [Ris78| . BIC [Sch78j . AIC [Aka73j ). All PML variants rely on a 
proper sampling model (which may be difficult to establish), ignore (or at least do not 
tell how to incorporate) a potentially given loss function (see |Yam99t IGru04j for ex- 
ceptions), are based on distribution- independent penalties (which may result in bad 
performance for specific distributions), and are typically limited to (semi)parametric 
models. 

Main idea. The main goal of the paper is to establish a criterion for selecting 
the "best" model complexity c based on regressors r c given as a black box without 
insight into the origin or inner structure of r c , that does not depend on things often 
not given (like a stochastic noise model), and that exploits what is/should be given 
(like the loss function, note that the criterion can also be used for loss-function 
selection, see Section [S]). The key observation we exploit is that large classes J- c or 
more flexible regressors r c can fit more data well than more rigid ones. We define 
the loss rank of r c as the number of other (fictitious) data D 1 that are fitted better 
by r c (D') than D is fitted by r c (D), as measured by some loss function. The loss 
rank is large for regressors fitting D not well and for too flexible regressors (in both 
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cases the regressor fits many other D' better). The loss rank has a minimum for 
not too flexible regressors which fit D not too bad. We claim that minimizing the 
loss rank is a suitable model selection criterion, since it trades off the quality of 
fit with the flexibility of the model. Unlike PML, our Loss Rank Principle (LoRP) 
works without a noise (stochastic sampling) model, and is directly applicable to any 
non-parametric regressor, like kNN. 

Related ideas. There are various other ideas that somehow count fictitious data. In 
normalized ML |Grii04] . the complexity of a stochastic model class is defined as the 
log sum over all D' of maximum likelihood probabilities. In the luckiness framework 
for classification |Her02} Chp.4], the loss rank is related to the level of a hypothesis, 
if the empirical loss is used as an unluckiness function. The empirical Rademacher 
complexity |Kol01t IBBL02] averages over all possible relabeled instances. Finally, 
instead of considering all D' one could consider only the set of all permutations of 
{yi,...,y n }, like in permutation tests |ET93j . The test statistic would here be the 
empirical loss. 

Contents. In Section |2l after giving a brief introduction to regression, we formally 
state LoRP for model selection. Explicit expressions for the loss rank for the im- 
portant class of linear regressors are derived in Section [3j this class includes kNN, 
polynomial, linear basis function (LBFR), kernel, projective regression, and some 
others. In Section HJ we establish optimality properties of LoRP for linear regres- 
sion, namely model consistency and asymptotic mean efficiency. Experiments are 
presented in Section |5j We compare LoRP to other selection methods and demon- 
strate the use of LoRP for some specific problems like choosing tuning parameters 
in kNN and spline regression. In Section |6] we compare linear LoRP to Bayesian 
model selection for linear regression with Gaussian noise and prior, and in Section 
[7] to PML, in particular MDL, BIC, and AIC, and then discuss two trace formulas 
for the effective dimension. Sections IBlfTUl can be considered as extension sections. 
In Section [8] we show how to generalize linear LoRP to non-quadratic loss, in par- 
ticular to other norms. We also discuss how LoRP can be used to select the loss 
function itself, in case it is not part of the problem specification. In Section |9] we 
briefly discuss interpolation. LoRP only depends on the regressor on data D and 
not on x £ {x\,...,x n }. We construct canonical regressors for off-data interpolation 
from regressors given only on-data, in particular for kNN, Kernel, and LBFR, and 
show that they are canonical. In Section [10] we derive exact expressions for kNN 
when {x\,...,x n } forms a discrete <i-dimensional hypercube, and discuss the limits 
n— too, k— >oo, and d— >oo. Section HT1 contains the conclusions of our work and 
further considerations that could be elaborated on in the future. 

The main idea of LoRP has already been presented at the COLT 2007 conference 
[Hut07j . In this paper we present LoRP more thoroughly, discover its theoretical 
properties and evaluate the method through some experiments. 
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2 The Loss Rank Principle 



After giving a brief introduction to regression, classification, model selection, over- 
fitting, and some reoccurring examples, we state our novel Loss Rank Principle for 
model selection. We first state it for classification (Principle [3] for discrete values), 
and then generalize it for regression (Principle for continuous values), and exem- 
plify it on two (over-simplistic) artificial Examples H] and Thereafter we show how 
to regularize LoRP for realistic regression problems. 

Setup and notation. We assume data D — (x,y) := {(xi,yi),...,(x n ,y n )} G (X x 
y) n =:T> has been observed. We think of the y as having an approximate functional 
dependence on x, i.e., yi ~ ftrae( x i), where means that the yi are distorted by 
noise from the unknown "true" values /true^i)- We will write (x,y) for generic data 
points, use vector notation x = (xi,...,x n ) and y = (y lr . .,y n ) T , and D' — (x',y') for 
generic (fictitious) data of size n. A full list of abbreviations and notations used 
throughout the paper is placed in the appendix. 

Regression and classification. In regression problems y is typically (a subset 
of) the real set M or some more general measurable space like M m . In classification, 
y is a finite set or at least discrete. We impose no restrictions on X. Indeed, x 
will essentially be fixed and plays only a spectator role, so we will often notationally 
suppress dependencies on x. The goal of regression/classification is to find a function 
fo^FcX^-y "close" to /true based on the past observations D. Or phrased in 
another way: we are interested in a mapping r : T> — > J 7 such that y := r(x\D) = 
r(D)(x)=f D (x)^f tIue (x) for all xeX. 

Example 1 (polynomial regression) For X = y = M, consider the set : = 
{fw(x) = WdX^ 1 + ... + W2X + W1 : w G M d } of polynomials of degree d— 1. Fitting 
the polynomial to data D, e.g., by least squares regression, we estimate w with wo- 
The regression function y = rd(x\D) = fw D {x) can be written down in closed form 
(see Example [9]) . {> 

Example 2 (k nearest neighbors) Let y be some vector space like M and X be 
a metric space like M m with some (e.g., Euclidean) metric d(-,-). kNN estimates 
/true(^) by averaging the y values of the k nearest neighbors Nk(x) of x in D, i.e., 
rk(x\D) — !X/ieVfc(a:)^ w ith \Afk(x)\ = k such that d(x,Xi) < d(x,Xj) for all i&Afk(x) 
and j£Af k {x). 

Parametric versus non-parametric regression. Polynomial regression is an 
example of parametric regression in the sense that rd(D) is the optimal function 
from a family of functions Td indexed by d<oo real parameters (w). In contrast, 
the kNN regressor is directly given and is not based on a finite-dimensional 
family of functions. In general, r may be given either directly or be the result of an 
optimization process. 

Loss function. The quality of fit to the data is usually measured by a loss func- 
tion Loss(y,y), where y t = /o(xi) is an estimate of y. t . Often the loss is additive: 
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Loss(y,y) = Xl™ =1 Loss(?/j,^). If the class T is not too large, good regression func- 
tions r(D) can be found by minimizing the loss w.r.t. all / G J 7 . For instance, 
r d (D) =argmm f £j r d Y^ =1 (yi-f(xi)) 2 and y = r d (x\D) in Example HJ 

Regression class and loss. In the following we assume a class of regressors 1Z 
(whatever their origin), e.g., the kNN regressors {r^ : k G IN} or the least squares 
polynomial regressors {r d : d G Wo := IVU{0}}. Each regressor r can be thought 
of as a model. Throughout the paper, we use the terms "regressor" and "model" 
interchangeably. Note that unlike / G J 7 , regressors r G 1Z are not functions of x 
alone but depend on all observations D, in particular on y. Like for functions /, we 
can compute the empirical loss of each regressor r&TZ: 

n 

Loss r (D) = Loss r (y|a;) := Loss(y, y) = y~Xoss(yj, r(xj\x, y)) 

i=i 

where yi = r(xi\D) in the third expression, and the last expression holds in case of 
additive loss. 

Overfitting. Unfortunately, minimizing Loss r w.r.t. r will typically not select the 
"best" overall regressor. This is the well-known overfitting problem. In case of 
polynomials, the classes JrfCJd+i are nested, hence Loss rd is monotone decreasing 
in d with Loss rn = perfectly fitting the data. In case of kNN, Loss rfc is more 
or less an increasing function in k with perfect regression on D for k — 1, since no 
averaging takes place. In general, TZ is often indexed by a "flexibility" or smoothness 
or complexity parameter, which has to be properly determined. The more flexible r 
is, the closer it can fit the data. Hence such r has smaller empirical loss, but is not 
necessarily better since it has higher variance. Clearly, too inflexible r also lead to 
a bad fit ("high bias"). 

Main goal. The main goal of the paper is to establish a selection criterion in order 
to specify the smallest model to which / true belongs or is close to, and simultaneously 
determine the "best" fitting function r(D). The criterion 

• is based on r given as a black box that does not require insight into the origin 
or inner structure of r; 

• does not depend on things often not given (like a stochastic noise model); and 

• exploits what is or should be given (like the loss function). 

Definition of loss rank. We first consider discrete y (i.e., classification), fix x, 
y is the observed data and y' are fictitious others. The key observation we exploit 
is that a more flexible r can fit more data D' ET> well than a more rigid one. The 
more flexible r is, the smaller the empirical loss Loss r (t/|cc) is. Instead of minimizing 
the unsuitable Loss r (y\x) w.r.t. r, we could ask how many y' £y n lead to smaller 
Loss r than y. We define the loss rank of r (w.r.t. y) as the number of y' &y n with 
smaller or equal empirical loss than y: 

Rank r (y|cc) = Rank r (L) := #{y' &y n ■ Loss r (^'|a;) <L} with L := Loss r (y \x) (1) 
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We claim that the loss rank of r is a suitable model selection measure. For (pQ) to 
make sense, we have to assume (and will later assure) that Rank r (L) <oo, i.e., there 
are only finitely many y' G^ n having loss smaller than L. 

Since the logarithm is a strictly monotone increasing function, we can also con- 
sider the logarithmic rank LK r (y\x) := logRank r (y|a;), which will be more conve- 
nient. 

Principle 3 (LoRP for classification) For discrete y , the best classi- 
fier /regressor r : T> x X — >■ y in some class TZ for data D = (x,y) is the one 
with the smallest loss rank: 

r best = argminLR r (j/|a;) = argmin ~R&nk r .(y\x) (2) 

where Rank r is defined in (pQ). 

We give a simple example for which we can compute all ranks by hand to help 
the reader better grasp how the principle works. 

Example 4 (simple discrete) Consider X = {1,2}, y = {0,1,2}, and two points 
D={(1,1),(2,2)} lying on the diagonal x=y, with polynomial (zero, constant, linear) 
least squares regressors TZ— {r ,ri,r 2 } (see ExJT]). r is simply 0, r\ the ^/-average, 
and r 2 the line through points (l,yi) and (2,y 2 ). This, together with the quadratic 
Loss for generic y' and observed y = (l,2) and fixed a; = (1,2), is summarized in the 
following table 
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From the Loss we can easily compute the Rank for all nine ^'e{0,l,2} 2 . Equal rank 
due to equal loss is indicated by a "=" in the table below. Whole equality groups 
are actually assigned the rank of their right-most member, e.g., for d=l the ranks 
of (y^) = (0,l),(l,0),(2,l),(l,2) are all 7 (and not 4,5,6,7). 
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9 



So LoRP selects T\ as best regressor, since it has minimal rank onD. r fits D too 
badly and r 2 is too flexible (perfectly fits all D'). <0 

LoRP for continuous y. We now consider the case of continuous or measurable 
spaces y, i.e., normal regression problems. We assume y = M in the following 
exposition, but the idea and resulting principle hold for more general measurable 
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spaces like IFF 1 . We simply reduce the model selection problem to the discrete case by 
considering the discretized space y s = eZ for small e>0 and discretize y^y e ^sZ n 
('W means "is replaced by"). Then Rank^(L) :=#{y' E ey2 :Loss r (y' £ \x)<L} with 
L = Loss r (y £ \x) counting the number of e-grid points in the set 

V r (L) := {y' E y n : Loss r {y'\x) < L} (3) 

which we assume (and later assure) to be finite, analogous to the discrete case. 
Hence Rank^(L)-e ra is an approximation of the loss volume \V r (L)\ of set V r (L), 
and typically Rank^(L)-e n = \V r (L)\ ■ (l + 0(e)) -» \V r (L)\ for e 0. Taking the 
logarithm we get LRl(y\x) = \ogRaiik £ r (L) = \og\V r (L)\—nloge+0(e). Since nloge is 
independent of r, we can drop it in comparisons like (J2j). So for e— >0 we can define 
the log-loss "rank" simply as the log- volume 

LR r (y\x) := log|V^.(L)|, where L := Loss r (y\x) (4) 

Principle 5 (LoRP for regression) For measurable y, the best regressor r:T>x 
X ^y in some class 71 for data D = (x,y) is the one with the smallest loss volume: 

r best = argmin LR r (y\x) = argmin |K(L)| 

r£TZ ' relZ 

where LR, V r , and L are defined in (J3J) and (jlj), and \V r (L)\ is the volume ofV r (L)C 

y n . 

For discrete y with counting measure we recover the discrete LoRP (Principle 

ED. 

Example 6 (simple continuous) Consider Example H] but with interval y= [0,2]. 
The first table remains unchanged, while the second table becomes 
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4 



So LoRP again selects ri as best regressor, since it has smallest loss volume on D. 





Infinite rank or volume. Often the loss rank/volume will be infinite, e.g., if 
we had chosen y = in ExJH or y = M in Exj6j There are various potential 
remedies. We could modify (a) the regressor r or (b) the Loss to make LR r finite, 
(c) the Loss Rank Principle itself, or (d) find problem-specific solutions. Regressors 
r with infinite rank might be rejected for philosophical or pragmatic reasons. We 
will briefly consider (a) for linear regression later, but to fiddle around with r in a 
generic (blackbox way) seems difficult. We have no good idea how to tinker with 
LoRP (c), and also a patched LoRP may be less attractive. For kNN on a grid we 
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later use remedy (d). While in (decision) theory, the application's goal determines 
the loss, in practice the loss is often more determined by convenience or rules of 
thumb. So the Loss (b) seems the most inviting place to tinker with. A very simple 
modification is to add a small penalty term to the loss. 

Loss r (y|:r) Loss"(y \x) := Loss r (y\x) + a\\y || 2 , a > "small" (5) 

The Euclidean norm \\y\\ 2 '^Yl^iVi * s default, but other (non)norm regularizations 
are possible. The regularized hK"(y\x) based on Loss" is always finite, since {y : 
\\y\\ 2 <L} has finite volume. An alternative penalty ay^y, quadratic in the regression 
estimates yi = r(xi\x,y) is possible if r is unbounded in every y^oo direction. 

A scheme trying to determine a single (flexibility) parameter (like d and k in 
the above examples) would be of no use if it depended on one (or more) other 
unknown parameters (a), since varying through the unknown parameter leads to 
any (non) desired result. Since LoRP seeks the r of smallest rank, it is natural to 
also determine a = a m \ n by minimizing LR" w.r.t. a. The good news is that this 
leads to meaningful results. Interestingly, as we will see later, a clever choice of a 
may also result in alternative optimalities of the selection procedure. 



3 LoRP for y-Linear Models 

In this section we consider the important class of y-linear regressions with quadratic 
loss function. By "y-linear regression" , we mean the linearity is only assumed in y 
and the dependence on x can be arbitrary. This class is richer than it may appear. It 
includes the normal linear regression model, kNN (Example Ej), kernel (Example El), 
and many other regression models. For y-linear regression and y = M, the loss rank 
is the volume of an n-dimensional ellipsoid, which can efficiently be computed in 
time 0(n 3 ) (Theorem [TO]) . For the special case of projective regression, e.g., linear 
basis function regression (Example [9]), we can even determine the regularization 
parameter a analytically (Theorem [TTT) . 

y-Linear regression. We assume y = M in this section; generalization to M m is 
straightforward. A y-linear regressor r can be written in the form 

n 

y = r(x\x,y) = '^^mj(x,x)yj Mx E X and some rrij : X x X n — > ]R (6) 

Particularly interesting is r for 

Vi = r( Xi \x,y) = Y, M iA x )Vi with M : X n lR nxn (7) 

3 

where matrix M^(x) =rrij(xi,x). Since LoRP needs r only on the training data x, 
we only need M. 

Example 7 (kNN ctd.) For kNN of Exfjwe have m j (x,x) = \ if jeAf k (x) and 
else, and M^-(cc) = | if j&J\fk(%i) an d else. ^> 
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Example 8 (kernel regression) Kernel regression takes a weighted average over 
y, where the weight of yj to y is proportional to the similarity of Xj to x, measured by 
a kernel K(x )=K(x,Xj)/Y^j = iK(x,Xj). For example the Gaussian 

kernel for X = M m is K(x,Xj) =e~^ x ~ Xj ^^ 2,j2 . The width a controls the smoothness 
of the kernel regressor, and LoRP selects the real-valued "complexity" parameter a. 



Example 9 (linear basis function regression, LBFR) Let be a 

set or vector of "basis" functions often called "features". We place no restrictions 
on X or tf> : X — > M d . Consider the class of functions linear in </>: 

For instance, for X = M and <j) a {x) =x a ~ l we would recover the polynomial regression 
Example [U For quadratic loss function Loss(?/j,t/i) = {y-i — y-i) 2 we have 



n 

Loss w (y\<f>) := - f w (xi)) 2 = y T y - 2y T $w + w T Bw 

i=l 



where matrix $ is defined by $j a = 4> a {xi) and B is a symmetric matrix with 
B a b — Yl^i^aix^^bixi) = [$ T $] a 6. The loss is quadratic in w with minimum at 
w = fi _1 $ T j/. So the least squares regressor is y = y T &B~ 1 <fi(x), hence rrij(x,x) — 
and M(x) = $B- 1 $ T . 
Consider now a general linear regressor M with quadratic loss and quadratic 
penalty n 2 

\jo$s a M {y\x) = ^2 \Vi - Y^=i M ijVj) + a \\y\\ 2 = y T s a y, 



i=l 



whera3 S a = (I — M ) T (J — M) + al (8) 

(J is the identity matrix). S a is a symmetric matrix. For a>0 it is positive definite 
and for a = positive semidefinite. If Ai,...,A n >0 are the eigenvalues of So, then 
\i+a are the eigenvalues of S a . V(L) = {y' ElR n :y' T S a y' <L} is an ellipsoid with the 
eigenvectors of S a being the main axes and ^/L/(Xi+a) being their length. Hence 

the volume is 

n I T 7, T n l 2 

w{L)\ = v „n ' 



i=l 



Xi + a y/ det S a 



where v n = 7i n ^ 2 /^! is the volume of the n-dimensional unit sphere, z\:—T(z+l), and 
det is the determinant. Taking the logarithm we get 

LR a M (y\x) = log|y(Ix)ssXf(i/|aj))| = f log(y T S a y) - \ log det S a + \ogv n (9) 

Since v n is independent of a and M it is possible to drop v n . Consider now a 
class of linear regressors Ai = {M}, e.g., the kNN regressors {M/~ : k G IV} or the 
(i-dimensional linear basis function regressors {M^:d^lN }. 



lr The mentioned alternative penalty a||y|| 2 would lead to S a = (I-M) T (I-M) + aM T M. For 
LBFR, penalty a||t&|| 2 is popular (ridge regression). Apart from being limited to parametric 
regression, it has the disadvantage of not being reparametrization invariant. For instance, scaling 
<t>a{x)^la4>a,{x) does not change the class Td, but changes the ridge regressor. 
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Theorem 10 (LoRP for y-linear regression) For y = M, the best linear regres- 
sor M : X n — >M nxn in some class M. for data D = (x,y) is 

To 

M best = argmin{ftog(y T S a |/)-flogdetS' a } = argmin { y aV } (10) 

where S a = S a (M) is defined in (JBJ). 

The last expression shows that linear LoRP minimizes the Loss times the geomet- 
ric average of the squared axes lengths of ellipsoid V(l). Note that M best depends 
on y unlike the MeM.. 

Nullspace of So- If M has an eigenvalue 1, then So = (I — M) T (I — M) has a 
zero eigenvalue and a > is necessary, since detSo = 0. Actually this is true for 
most practical M. Most linear regressors are invariant under a constant shift of 
y, i.e., r(x\x,y + c) = r(x\x,y) + c, which implies that M has eigenvector (1,...,1) 
with eigenvalue 1. This can easily be checked for kNN (Exj2j), kernel (ExJHJ), and 
LBFR (Exj9]). Such a generic 1-eigenvector effecting all M E M could easily and 
maybe should be filtered out by considering only the orthogonal space or dropping 
these Aj = when computing detSo. The 1-eigenvectors that depend on M are the 
ones where we really need a regularizer a > 0. For instance, in LBFR has d 
eigenvalues 1, and MkNN has as many eigenvalues 1 as there are disjoint components 
in the graph determined by the edges My >0. In general we need to find the optimal 
a numerically. If M is a projection we can find a m analytically. 

Numerical approximation of (detS a ) 1 ^ n and the computational complex- 
ity of linear LoRP. For each a and candidate model, the determinant of S a in the 
general case can be computed in time 0(n 3 ). Often M is a very sparse matrix (like 
in kNN) or can be well approximated by a sparse matrix (like for kernel regression), 
which allows us to approximate detS'a sometimes in linear time |Reu02j . To search 
the optimal a and M, the computational cost depends on the range of a we search 
and the number of candidate models we have. 

Projective regression. Consider a projection matrix M = P = P 2 with d(=trP) 
eigenvalues 1, and n—d zero eigenvalues. For instance, M = $i? _1 $ T of LBFR ExJHJ 
is such a matrix. This implies that S a has d eigenvalues a and n — d eigenvalues 
1+a, thus detS a = a d (l+a) n ~ d . Let p— \\y— y || 2 /||y|| 2 , then y T S a y=(p+a)y T y and 

LRp = f log y T y + f log(p + a) - f log a - ^ log(l + a). (11) 

Solving dLRp/da = w.r.t. a we get a minimum at a = a m := ^_^ n _ d provided that 
1 — p>d/n. After some algebra we get 

LR^ = flog^-fKL(^||l-p), where KL(p\\q) :=plog2 + (1 -p)log±E§ (12) 

is the relative entropy or the Kullback-Leibler divergence. Note that ( JT2l) is still 
valid without the condition l—p>d/n (the term log((l— p)n— d) has been canceled 
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in the derivation). What we need when using (fT2|) is that d<n and p< 1, which 
are very reasonable in practice. Interestingly, if we use the penalty a||y|| 2 instead of 
a||y|| 2 , the loss rank then has the same expression as (|T2l) without any conditional. 

Minimizing LRp m w.r.t. P is equivalent to maximizing KL(^||1 — p). The term p 
is a measure of fit. If increases, then p decreases and otherwise. We are seeking 
a tradeoff between the model complexity d and the measure of fit p, and LoRP 
suggests the optimal tradeoff by maximizing KL. 

Theorem 11 (LoRP for projective regression) The best projective regressor 
P:X n ^-M nxn with P = P 2 in some projective class V for data D = (x,y) is 

P best = argmax KL(*^||»^»). (13) 

to p g -p V n II yy > V J 

4 Optimality Properties of LoRP for Variable Se- 
lection 

In the previous sections, LoRP was stated for general-purpose model selection. By 
restricting attention to linear regression models, we will point out in this section 
some theoretical properties of LoRP for variable (also called feature or attribute) 
selection. 

Variable selection is probably the most fundamental and important topic in lin- 
ear regression analysis. At the initial stage of modeling, a large number of potential 
covariates are often introduced; one then has to select a smaller subset of the co- 
variates to fit/interpret the data. There are two main goals of variable selection, 
one is model identification, the other is regression estimation. The former aims at 
identifying the true subset generating the data, while the latter aims at estimating 
efficiently the regression function, i.e., selecting a subset that has the minimum mean 
squared error loss. Note that whether or not there is a selection criterion achieving 
simultaneously these two goals is still an open question |Yan05t IGrii04j . We show 
that with the optimal parameter a (defined as a m that minimizes the loss rank LR^ f 
in a), LoRP satisfies the first goal, while with a suitable choice of a, LoRP satisfies 
the second goal. 

Given d+1 potential covariates Xo = l,Xi,...,Xd and a response variable Y, let 
X = x be a non-random design matrix of size fix (d+1) and y be a response vector 
respectively (if y and X are centered, then the covariate 1 can be omitted from 
the models). Denote by S = {0,ji,...j\s\-i} the candidate model that has covariates 
X ,Xj 1 ,...,Xj lsl _ 1 . Under a proposed model S, we can write 

y = X S (3 S + a s e 

2 Then S a — (I n — P) T (I n — P) + aP T P = I n + (a — 1)P has d eigenvalues a and n — d eigenvalues 
1, thus det(S a ) = a d . The loss rank LRp = |logy T y + |log(l + (a — 1)(1 — p)) — floga is minimized 
at a m — (1 _ \t n _ d) ■ After some algebra we get the same expression of LRp" 1 as (TT2)) . 
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where e is noise with expectation E[e] = and covariance Cov(e) = I n , org > 0, 
(3 S = (f3 ,(3j 1 ,...,f3j lsl _ 1 ) T , and X$ is the nx \S\ design matrix obtained from X by 
removing the (j + l)st column for all j <^S. 

Model consistency of LoRP for variable selection. The ordinary least squares 
(OLS) fitted vector under model S is y s = M s y with M s = X s (X T s X s )- l X~ r s being 
a projection matrix. From Theorem [11] the best subset chosen by LoRP is 

S n = argminLR^ = argmax{KL(^||l - p s )}, p s = \yf/ • 

The term p$ is a measure of fit. It will be very close to if model S is big, otherwise, 
it will be close to 1 if S is too small. Therefore, it is reasonable to consider only 
cases in which p$ is bounded away from and 1. In order to prove the theoretical 
properties of LoRP, we need the following technical assumption. 

(A) For each candidate model S, ps is bounded away from and 1, i.e., there are 
constants c\ and c<i such that <c\ <ps < C2 < 1 with probability 1 (w.p.l). 

Let erg— \\y — y s \\ 2 /n and «S nu u = {0}. It is easy to see that for every S 

i-P5 = \\y s \\ 2 /\\y\\ 2 i na 2 s = p s \\y\\ 2 , ny 2 = \\y Sn J\ 2 < \\y s \\ 2 < \\v\\ 2 (14) 

where y denotes the arithmetic mean Yl^iVil 71 - Assumption (A) follows from 
(A') 0<liminf(y) 2 <limsup(i||j/|| 2 ) <oo and MS : <r| ->cr| > w.p.l. 

The first condition of (A') is obviously very mild and satisfied in almost all cases in 
practice. The second one is routinely used to derive asymptotic properties of model 
selection criteria (e.g., Theorem 2 of [Sha97] and Condition 1 of |WLT07] ) . 

Lemma 12 (LoRP for variable selection) The loss rank of model S is 

LR 5 = LR£» = f \og{na%) + f H(&) + | log (15) 

where ps and <tJ are defined in f|T4l) . and H{p) := — plogp — (1— p)log(l— p) is the 
entropy of p. Under Assumption (A) or (A'), after neglecting constants independent 
of S, the loss rank of model S has the form 

LR S = |loga| + Mi ogr2 + Op(l), (16) 

where Op(l) denotes a bounded random variable w.p.l. 

Proof. Inserting y T y = na^/ps into (1121) and rearranging terms gives f|T5|) . 
By Assumption (A) the last term in f|T5|) is bounded w.p.l. Taylor expansion 
log(l — p) = — p + 0(p 2 ) implies H(p)/p + \ogp — > 1, hence = ^logr?, + 0(l). 

Finally, dropping the iS-independent term |logn from (|T5l) gives ( ITBl . ■ 

This lemma implies that the loss rank LR^ here is a BIC-type criterion, thus we 
immediately can state without proof the following theorem which is the well-known 
model consistency of BIC-type criteria (interested readers can find the routine proof 
in, for example, [Cha06j). 
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Theorem 13 (Model consistency) Under Assumption (A) or (A'), LoRP is 
model consistent for variable selection in the sense that the probability of selecting 
the true model goes to 1 for data size n— >oo. 



The optimal regression estimation of LoRP. The second goal of model selection 
is often measured by the (asymptotic) mean efficiency |Shi 83] which is briefly defined 
as follows. Let St denote the true model (which may contain an infinite number of 
covariates). For a candidate model S, let L n (S) = \\X St (3 St — X s f3 s \\ 2 be the squared 
loss where j3 s is the OLS estimate, and R n (S) = E[L n («S)] be the risk. The mean 
efficiency of a selection criterion 8 is defined by the ratio 

mf s R n (S) 

eS(s) = WJM £ 1 

where S$ is the model selected by 5. 5 is said to be asymptotically mean efficient if 
lim inf n ^. 00 eff(5) = l. 

By minimizing the loss rank in a we have shown in the previous paragraph that 
LoRP satisfies the first goal of model selection. We now show that with a suitable 
choice of a, LoRP also satisfies the second goal. 

From ( II l|) . we have 

LR a s (y\x) = f log(aI + 2y T y) + § logn - f log(a) - ^ log(l + a). 

By choosing a = a = exp(— jj^zjifz^j)^ under Assumption (A), the loss rank of model 
S (neglecting the common constant |logn) is proportional to 

LR«( 2/ |x)=nlog ( T 2 + ^^ + op(l), 

which is the corrected AIC of [HT89J. As a result, LoRP (a) is optimal in terms of 
regression estimation, i.e., it is asymptotically mean efficient (|Shi83]. 1983; [Sh a97j . 
1997). 

Theorem 14 (Asymptotic mean efficiency) Under Assumption (A) or (A'), 
with a suitable choice of a, the loss rank is proportional to the corrected AIC. As a 
result, LoRP is asymptotically mean efficient. 



5 Experiments 

In this section we present a simulation study for LoRP, compare it to other meth- 
ods and also demonstrate how LoRP can be used for some specific problems like 
choosing tuning parameters for kNN and spline regression. All experiments are 
conducted by using MATLAB software and the source code is freely available at 
http:/ /www. hutterl.net/ai/lorpcode. zip. 
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Comparison to AIC and BIC for model identification. Samples are generated 
from the model 

y = A, + PiXi + ... + P d X d + e, e ~ N(0, a 2 ) (17) 

where (3 is the vector of coefficients with some zero entries. Without loss of gen- 
erality, we assume that po = 0, otherwise, we can center the response vector y and 
standardize the design matrix X to exclude /?o from the model. We shall compare 
the performance of LoRP to that of BIC and AIC with various factors n, d and 
sienal-to-noise ratio (SNR) which is \\f3\\ 2 /a 2 (||/3|| 2 is often called the length of the 
signal) . 

For a given set of factors (n, d, SNR), the way we simulate a dataset from model 
(117p is as follows. Entries of X are sampled from a uniform distribution on [—1,1]. To 
generate (3, we first create a vector u= (ui,...,Ud) T whose entries are sampled from 
a uniform distribution on [—1,1]. The number of true covariates d* is randomly 
selected from {l,2,...,d}, the last d—d* entries of u are set to zero, then coefficient 
vector j3 is computed by fa — {length of signal} |. In our simulation, the 
length of signal was fixed to be 10. n observation errors ei,...,e„ are sampled from a 
normal distribution with mean and variance a 2 = \ \(3\ | 2 /SNR. Finally, the response 
vector is computed by y = X{3+e. For each set of factors (n, d, SNR), 1000 datasets 
are simulated in the same manner to assess the average performance of the methods. 
For simplicity, a candidate model is specified by its order, i.e., we search the best 
model among only d models {l},{l,2}...,{l,2,...,d}. For the general case, an efficient 
branch-and-bound algorithm [Mil02, Clip. 3] can be used to exhaustively search for 
the best subsets. 

Table CE] presents percentages of correctly-fitted models with various factors n, 
d and SNR. As shown, LoRP outperforms the others. The better performance of 
LoRP over BIC, which is the most popular criterion for model identification, is very 
encouraging. This is probably because LoRP is a selection criterion with a data- 
dependent penalty. This improvement needs a theoretical justification which we 
intend to do in the future. 



Table 1: Percentage of correctly- fitted models over 1000 replications 



n d 


SNR 


AIC 


BIC 


LoRP 


n d 


SNR 


AIC 


BIC 


LoRP 


100 5 


1 


62 


62 


69 


300 5 


1 


74 


82 


83 




5 


85 


85 


86 




5 


78 


90 


91 




10 


80 


90 


91 




10 


81 


94 


94 


10 


1 


52 


42 


54 


10 


1 


63 


67 


71 




5 


63 


77 


77 




5 


70 


85 


86 




10 


68 


84 


85 




10 


74 


90 


90 


20 


1 


32 


22 


36 


20 


1 


54 


45 


61 




5 


55 


63 


65 




5 


64 


79 


80 




10 


56 


73 


74 




10 


67 


85 


85 
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Comparison to AIC and BIC for regression estimation. Consider the fol- 
lowing model which is from [Shi83] 



y = y{x) = log ^ + e, e ~ N{0, a 2 ), x E [0, 1). (18) 

We approximate the true function by a Fourier series and consider the problem of 
choosing a good order among models 

y = f3 o + J2 C ^0 m ^ + ^ k = l,...,K. 
i=i 

In the present context, a model in Section @] is completely specified by the order K of 
the Fourier series. Samples are created from ( Tl8|) at the points Xi = 5^^, i = l,...,n. 
As in |Shi83j . we take S = .99, and A = 163 with various n and a. The performance 
is measured by the estimate of mean efficiency over 1000 replications. 

Table [2] represents the simulation results. In general, LoRP (with a = a as in 
Section H]) outperforms the others, except for cases with unrealistically high noise 
level. For cases with high noise, mean efficiency of BIC is often larger than that 
of AIC and LoRP. This was also shown in the simulation study of [Shi83], Table 1. 
This phenomenon can be explained as follows. 

The risk of model k (the model specified by its order k) is R n (k) = ||(J — 
2/trucl| 2 + ^ cr2 where M k is the regression matrix under model k and y tvuc is 
the vector of true values y(xi). When o — >oo, the ideal k* = arginffci? n (A;) — > 1. Be- 
cause BIC penalizes the model complexity more strongly than AIC and LoRP do, 
the order chosen by BIC is closer to k* = 1 than the ones chosen by AIC and LoRP. 
As a result, mean efficiency of BIC is larger than that of the others. 



Table 2: Estimates of mean efficiency over 1000 replications 



n a 


AIC 


BIC 


LoRP 


n 


a 


AIC 


BIC 


LoRP 


400 .001 


1.00 


.98 


.99 


600 


.001 


1.00 


.98 


1.00 


.01 


.93 


.68 


.90 




.01 


.99 


.67 


.92 


.05 


.88 


.67 


.95 




.05 


.90 


.66 


.94 


.1 


.88 


.67 


.92 




.1 


.90 


.67 


.93 


.5 


.81 


.66 


.85 




.5 


.82 


.66 


.83 


1 


.79 


.63 


.82 




1 


.79 


.65 


.82 


5 


.67 


.65 


.70 




5 


.65 


.67 


.66 


10 


.54 


.67 


.59 




10 


.54 


.59 


.54 


100 


.31 


.89 


.33 




100 


.40 


.90 


.41 



LoRP for selecting a good number of neighbors in kNN. Let us now see how 

LoRP can be applied to select a good parameter k in kNN regression. 
We created a dataset of n = 100 observations (xi,yi) from the model: 

y = f{x) + e, with /(*) = sm(1 ^+ a2)) , x G [0, 1] (19) 
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Figure 1: Choosing the tuning parameters in kNN and spline regression. The curves 
have been scaled by their standard deviations. 

where e~N(0,a 2 ) with a = 0.5. The regression matrix for kNN regression is 
determined by — | if j GA4(xj) and else. Then, the loss rank is 

LR(fc) = wZ{z]o g (y T SWy) - f logdetSf }, 

where Sa^ = (I — M^) T (I — M^) + aI. The most widely-used method to select a 
good k is probably Generalized Cross- Validation (GCV) [CW79) : GCV(A;)=n||(I- 
M^)y\\ 2 /[tr(I-M^)} 2 . To judge how well GCV and LoRP work, we compare 
them to the expected prediction error defined as 

n n 
i=l »=1 3&M k {xi) 

Figure U[a) shows the curves LR(Jfe), GCV(Jfe), EPE(/c) for fc = 2,...,20 (the trivial 
case = 1 is omitted) , in which k = 7- nearest neighbors is chosen by LoRP and k = 8 
is chosen by GCV. The "ideal" k is 5. Both LoRP and GCV do a reasonable job. 
LoRP works slightly better than GCV. 

LoRP for selecting a good smoothing parameter. We now further demon- 
strate the use of LoRP in selecting a good smoothing parameter for spline regression. 
Consider the following problem: find a function belonging to the class of functions 
with continuous 2nd derivative that minimizes the following penalized residual sum 
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of squares: 

n „ 

RSS(/) = J> - f^)f + X / U"(?)?dt, 
i=i J 

where A is called the smoothing parameter. The second term penalizes the curvature 
of function / and the smoothing parameter A controls the amount of penalty. Our 
goal is to choose a good A. 

It is well known (see, e.g., [HTFOlj . Section 5.4) that the solution is a natu- 
ral spline f{x) = Y^j = iNj{x)6j where Ni(x),...,N n (x) are the basis functions of the 
natural cubic spline: 

N^x) = 1, N 2 (x) = x, N k+2 (x) = d k (x) - d n ^(x) with d k (x) = { ^^^± . 

The problem thus reduces to finding a vector 6 M n that minimizes 

RSS(0) = {y- NO) T (y - N0) + X6 T Q6 

where iVV,- = Nj(xi) and f^j = fN"(x)N"(x)dx. It is easy to see that the solution 
is A = (N T N + Xny 1 N T y, and the fitted vector is y = N0 X = M x y with M A = 
N(N T N+\Q)~ 1 N T y. The fitted vector is linear in y, thus the loss rank is 

LR(A) = argmin{| log(y T S%y) - \ logdet S a x } 

where S% = (I -M X ) T (I -M x ) + al . 

Let us consider again the dataset generated from model ffl9|) . Figure [T^b) shows 
the curves LR(A), GCV(A) and EPE(A). The derivation of expressions for GCV(A) 
and EPE(A) is similar to the previous example. A ~ 3 x 10~ 4 is the optimal value 
selected by the "ideal" criterion EPE. A~5xl0 -4 and A~7xl0~ 4 are selected by 
LoRP and GCV, respectively. One again, like the previous example, LoRP selects 
a better A than GCV does. 



6 Comparison to Gaussian Bayesian Linear Re- 
gression 

We now consider LBFR from a Bayesian perspective with Gaussian noise and prior, 
and compare it to LoRP. In addition to the noise model as in PML, one also has 
to specify a prior. Bayesian model selection (BMS) proceeds by selecting the model 
that has largest evidence. In the special case of LBFR with Gaussian noise and prior 
and a type II maximum likelihood estimate for the noise variance, the expression 
for the evidence has a similar structure as the expression of the loss rank. 

Gaussian Bayesian LBFR / MAP. Recall from SecJH ExJH] that F d is the class 
of functions f w (x) =w T (f)(x) (wEM d ) that are linear in feature vector </>. Let 

n ( i \ exp(-|(z-/x) T a- 1 (2-M)) 
GauHB^cr) := (2f) ^ (20) 
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denote a general iV-dimensional Gaussian distribution with mean fi and covariance 
matrix a. We assume that observations y are perturbed from f w (x) by independent 
additive Gaussian noise with variance and zero mean, i.e., the likelihood of y 
under model w is P(y \w) = Gauss n (y|$u;,/3 _1 /), where $j a = 4> a (xi). A Bayesian 
assumes a prior (before seeing y) distribution on w. We assume a centered Gaussian 
with covariance matrix (aC) -1 , i.e., P(w) = GausSd(n>|0,a~ 1 C~ 1 ). From the prior 
and the likelihood one can compute the evidence and the posterior 

Evidence: P{y) = J P(y\w)P(w)dw = Gauss„(?/|0, /T 1 ^ 1 ) (21) 

Posterior: P(w\y) = P (y\w)P (w) / P (y) = G&usSd(w\w, A^ 1 ) 

B : = $ T $, A := aC + /3B, M := P^A' 1 ^, S := I - M, (22) 

i& := /3A _1 $ T y, y := &w = My 

A standard Bayesian point estimate for w for fixed d is the one that maximizes 
the posterior (MAP) (which in the Gaussian case coincides with the mean) w = 
&rgmax w P (w\y) = (3A~ l $ T y. For a — > 0, MAP reduces to Maximum Likelihood 
(ML), which in the Gaussian case coincides with the least squares regression of 
ExJHJ For a>0, the regression matrix M is not a projection anymore. 

Bayesian model selection. Consider now a family of models {J-'i,^,---}- Here 
the Td are the linear regressors with d basis functions, but in general they could 
be completely different model classes. All quantities in the previous paragraph 
implicitly depend on the choice of J 7 , which we now explicate with an index. In 
particular, the evidence for model class J 7 is Pjr(y). BMS chooses the model class 
(here d) J 7 of highest evidence: 

jrBMS = argmaxP^j/) 

Once the model class J rBMS is determined, the MAP (or other) regression function 
fw T BMs or Mj-bms are chosen. The data variance may be known or estimated 
from the data, C is often chosen /, and a has to be chosen somehow. Note that 
while a — > leads to a reasonable MAP=ML regressor for fixed d, this limit cannot 
be used for BMS. 

Comparison to LoRP. Inserting ( 120]) into ( 121]) and taking the logarithm we see 
that BMS minimizes 

- log P F {y) = ly T Sy - \ log det S - % log £ (23) 

w.r.t. J 7 . Let us estimate /3 by ML: We assume a broad prior a<^/3 so that /?|| = 
0{f) can be neglected. Then - 9lo ^(y) = ly^Sy- ^ + 0(fn) = 0^/3^/3: = 
n/(y T Sy). Inserting (3 into ( 123]) we get 

-logPHl/) = f log V T Sy -| log det S -f log ^ (24) 
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Taking an improper prior P(/3) oc/3" 1 and integrating out /3 leads for small a to a 
similar result. The last term in (124"]) is a constant independent of J 7 and can be 
ignored. The first two terms have the same structure as in linear LoRP ( flUj) . but 
the matrix S is different. In both cases, a act as regularizers, so we may minimize 
over a in BMS like in LoRP. For a = (which neither makes sense in BMS nor in 
LoRP), M in BMS coincides with M of ExJHJ but still the Sq in LoRP is the square 
of the S in BMS. For a>0, M of BMS may be regarded as a regularized regressor 
as suggested in Secj2] (a), rather than a regularized loss function (b) used in LoRP. 
Note also that BMS is limited to (semi)parametric regression, i.e., does not cover 
the non-parametric kNN Exj2]and kernel ExJHl unlike LoRP. 

Since B only depends on x (and not on y), and all P are implicitly conditioned 
on x, one could choose C = B. In this case, M = 7$i?~ 1( l )T , with 7 = < 1 
for a > 0, is a simple multiplicative regularization of projection $5 _1 $ T , and ( I24p 
coincides with ( II ip for suitable a, apart from an irrelevant additive constant, hence 
minimizing (|24p over a also leads to (Tl2l . 

7 Comparison to other Model Selection Schemes 

In this section we give a brief introduction to PML for (semi)parametric regression, 
and its major instantiations, AIC, BIC, and MDL principle, whose penalty terms are 
all proportional to the number of parameters d. The effective number of parameters 
is often much smaller than d, e.g., if there are soft constraints like in ridge regression. 
We compare MacKay's trace formula |Mac92] for Gaussian Bayesian LBFR and 
Hastie's et al. trace formula |HTF01] for general linear regression with LoRP. 

Penalized ML (AIC, BIC, MDL). Consider a <i-dimensional stochastic model 
class like the Gaussian Bayesian linear regression example of Section[6j Let Pd{y\w) 
be the data likelihood under d- dimensional model w£lR d . The maximum likelihood 
(ML) estimator for fixed d is 

w = argmaxPrf(y|tt;) = argmin{— logP d(y\w)} (25) 

w w 

Since — logP d(y\w) decreases with d, we cannot find the model dimension by sim- 
ply minimizing over d (overfitting). Penalized ML adds a complexity term to get 
reasonable results 

d = argmin{— log Pd{y \ w) + Penalty (d)} (26) 

d 

The penalty introduces a tradeoff between the first and second term with a minimum 
at d<oo. Various penalties have been suggested: AIC |Aka73] uses d, BIC |Sch78] 
and the (crude) MDL |Ris78j lGru04] use |logra for Penalty (d). There are at least 
three important conceptual differences to LoRP: 

• In order to apply PML one needs to specify not only a class of regression 
functions, but a full probabilistic model P d (y\w), 



19 



• PML ignores or at least does not tell how to incorporate a potentially given 
loss- function, 

• PML is mostly limited to selecting between (semi)parametric models. 

We discuss two approaches to the last item in the remainder of this section (where 
AIC, BIC, and MDL are not directly applicable): (a) for non-parametric models like 
kNN or kernel regression, or (b) if d does not reflect the "true" complexity of the 
model. [Mac92j suggests an expression for the effective number of parameters d e ff as 
a substitute for d in case of (b), while |HTF01] introduce another expression which 
is applicable for both (a) and (b). 

The trace penalty for parametric Gaussian LBFR. We continue with the 
Gaussian Bayesian linear regression example (see Section [6] for details and notation). 
Performing the integration in ( 12 ip . |Mac92| Eq.(21)] derives the following expression 
for the Bayesian evidence for C—I 

-\ogP(y) = {aE w + /^) + (§logdet A -flog a) -flog£ (27) 
E D = ±\\®w-y\\l, E w = ±\\w\\ 2 2 

(the first bracket in (127]) equals ^y T Sy and the second equals — ^logdetS 1 , cf. (123]) ). 
Minimizing ( 12 7 jl w.r.t. a leads to the following relation: 

= = Ew + ltTA-^-£ (£\ogdetA = trA-i) 

He argues that corresponds to the effective number of parameters, hence 

rfMcK . = a ||^||2 = 2a E w = d-atiA- 1 (28) 

The trace penalty for general linear models. We now return to general linear 
regression y = M(x)y (jTJ). LBFR is a special case of a projection matrix M = M 2 
with rank d = trM being the number of basis functions. M leaves d directions 
untouched and projects all other n — d directions to zero. For general M, [HTF01, 
Sec. 5. 4.1] argue to regard a direction that is only somewhat shrunken, say by a 
factor of < (3 < 1, as a fractional parameter ((3 degrees of freedom). If fix ,...,/?„ 
are the shrinkages = eigenvalues of M, the effective number of parameters could be 
defined as [HTF01L Sec. 7. 6] 

n 

d T ■= X> = trM > 
i=i 

where HTF stands for Hastie-Tibshirani-Friedman, which generalizes the relation 
d = trM beyond projections. For MacKay's M (l2"2"]l . trM = d-atrA~ 1 , i.e., rf"J F is 
consistent with and generalizes d^ K . 

Problems. Though nicely motivated, the trace formula is not without problems. 
First, since for projections, M = M 2 , one could have argued equally well for d^ F = 
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trM 2 . Second, for kNN we have trM=| (since M is | on the diagonal), which does 
not look unreasonable. Consider now kNN', which is defined as follows: we average 
over the k nearest neighbors excluding the closest neighbor. For sufficiently smooth 
functions, kNN' for suitable k is still a reasonable regressor, but trM = (since M is 
zero on the diagonal). So c^j.J F = for kNN', which makes no sense and would lead 
one to always select the k = l model. 

Relation to LoRP. In the case of kNN', trM 2 would be a better estimate for 
the effective dimension. In linear LoRP, — logdetS'o, serves as complexity penalty. 
Ignoring the nullspace of S = (I— M) T (J— M) jSJ), we can Taylor expand — |logdetSo 
in M 

oo 

-ilogdetSo = -trlog(J-M) = ±tr(M s ) = trM + §trM 2 + ... 

s=l 

For BMS d21J) with S = I — M ([22]) we get half of this value. So the trace penalty 
may be regarded as a leading order approximation to LoRP. The higher order terms 
prevent peculiarities like in kNN'. 

Coding/MDL interpretation of LoRP. The basic idea of MDL is as follows 
[Grii04] : "The goal of statistical inferences may be cast as trying to find regularity 
in the data. 'Regularity' may be identified with 'ability to compress'. MDL combines 
these two insights by viewing learning as data compression: it tells us that, for a 
given set of hypotheses H and data set D, we should try to find the hypothesis or 
combination of hypotheses in % that compress D most." 

The standard incarnation of (crude) MDL is as follows: If if is a stochastic 
model of (discrete) data D, we can code D (by Shannon-Fano) in \— log 2 P(-D|i?)] 
bits. If we have a class of models H, we also have to code H (somehow in, say, 
L(H) bits) in order to be able to decode D. MDL chooses the hypothesis H MDL = 
argmin^ g ^{— log 2 P {D\H) + L(H)} of minimal two-part code. For instance, if H, is 
the class of all polynomials of all degrees with each coefficient coded to |log 2 n bits 
(i.e., 0(n -1 / 2 ) accuracy) and we condition on x, i.e., D^y\x, MDL takes the form 
(1251) and flU, i.e., H MDL = (w,a). 

We now give LoRP (for discrete D) a data compression/MDL interpreta- 
tion. For simplicity, we will first assume that all loss values are different, i.e., if 
Loss r (y'\x) 7^ Loss r (y"\x) for y' ^y" (adding infinitesimal random noise to Loss r 
easily ensures this). In this case, Rank r (-|cc) -.y n IV is an order preserving bijec- 
tion, i.e., Rank r (^'|a?) < Rank r (y"\x) iff Loss r (y'\x) < Loss r (y"|a;) with no gaps in 
the range of Rank r (-|a?). 

Phrased differently, Rank r (-|cE) codes each y' &y n as a natural number m in 
increasing loss-order. The natural number m can itself be coded in [log 2 m] bits 
(using plain not prefix coding). Let us call this code of y' the Loss Rank Code 
(LRC). LRC has a nice characterization: LRC is the shortest loss-order preserving 
code. Ignoring the rounding, the Length of LRC r (t/|:r) is LR r (^'|a3): 
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Proposition 15 (Minimality property) If all loss values are different, i.e., if 

Loss r (y'\x) ^ Loss r (y"\x) for all y =/ y" 

then the loss rank (code) ofy is the smallest/ shortest among all loss-order preserving 
rankings/ codes C in the sense that 

Rank(y) = mm{C(y) : C G y n ^-lN A (*) } 
[LR(y)/log2j = min{Length(C(y)) : CGf ^{0,1}* A (★) } 
(*) := [Loss^O < Loss(y") & C(y') < C(y"), Vy', y"\ 

The proof follows from the fact that if a discrete injection (code) is order pre- 
serving, there exists a "smallest" one without gaps in the range. So LoRP mini- 
mizes the Loss Rank Code, where LRC itself is the shortest among all loss-order 
preserving codes. From this perspective, LoRP is just a different (non-stochastic, 
non-parametric, loss-based) incarnation of MDL. The MDL philosophy provides a 
justification of LoRP (j2J), its regularization (jSJ), and loss function selection (Section 
|SJ). This identification should also allow to apply or adapt the various consistency 
results of MDL, implying that LoRP is consistent under some mild conditions. 

If some losses are equal, Rank r (-|£c) :y n ^]N still preserves the order <, but the 
mapping is neither surjective nor injective anymore. 

Large regression classes 1Z. The classes TZ of regressors we considered so far 
were discrete and "small", often indexed by an integer complexity index (like k in 
kNN or d in LBFR). But large classes are also thinkable. 

As an extreme case, consider the class of all regressors. Clearly, there is an t=td 
which "knows" D and perfectly fits D (r(xi\D) — y%, Vz), but is the worst possible 
on all other D' {r(x i \D') = oo 1 Vz, VD'^D). This r has (discrete) Rank 1, so is best 
according to LoRP. So if TZ is too large, LoRP can overfit too. 

Consider a more realistic example by not taking all of the first d basis functions in 
LBFR, but selecting some basis functions 0^,...,^, i.e., TZ is indexed by d integers, 
and d may be variable too. 

One solution approach is to group more regressors in TZ into one function class 
J 7 , e.g., the class of functions J r k,d={uJi(pi 1 +...Wd(pi d -wE]R d , l<i\< ...<id<k} that 
are linear in d of the first k bases. Now 1Z is a small class indexed by d and k only. 

Looking at the coding interpretation of LR r and the MDL philosophy, suggests 
to assign a code to rGiR in order to get a complete code for D: 

r best = argmin{LR r (y|a;) + L(r)} 

r 

where r is the length of a code for r (given 1Z). For 7Z ~ IV a single integer 
has to be coded, e.g., k in L(r) — L(k) ~ log 2 /c bits, which can usually be safely 
dropped/ignored. For more complex classes like the (ungrouped) LBFR subset se- 
lection above, L(r) = L(i 1 ,...,i d ,d)~dlog 2 k-\-log 2 d can become important. 
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8 Loss Functions and their Selection 

General additive loss. Linear LoRP y = M(x)y of Section [3] can easily be gener- 
alized to non-quadratic loss. Let us consider the p>0 loss 

hoss M {y\x) := 
V(L) = 
Let < := 

where ^!:=r(- + l), be the volume of the unit ^-dimensional p-norm "ball". Since 
V(L) is a linear transformation of this ball with transformation matrix (I — M) _1 
and scaling L, we have \V(L)\ = v^L n /det(I — M), hence 

LR M (y\x) = log|y(Lossj|f(i/|x))| = nlog||(/-M)y|| p -logdet(/-M) +log< 

(29) 

For the p = 2 norm, (|29|) reduces to LR° M Q). Note that Lossa/ : = g(\\y — y \\ p ) leads 
to the same result (|29|) for any monotone increasing g, i.e., only the order of the loss 
matters, not its absolute value. More generally Lossm = g(J2ih(yi — Vi)) f° r am/ ^ 
implies 

LR M (y\x) = nlogv^Y.i h(yi - fa)) - logdet(J-M), where 
v h n (l) := \{zeR n :E,K^<l}\ 1/n 

is a one-dimensional function of / (independent D and M), once to be determined 
(e.g., v^(l) =l-(v?) I n ocl for p-norm loss). Regularization may be performed by 
M^^fM with optimization over 7< 1. 

Loss-function selection. In principle, the loss function should be part of the 
problem specification, since it characterizes the ultimate goal. For instance, whether 
a test should more likely classify a healthy person as sick than a sick person as 
healthy, depends on the severity of a misclassification (loss) in each direction. In 
reality, though, having to specify the loss function can be a nuisance. Sure, the loss 
has to respect some general features, e.g., that it increases with the deviation of fa 
from yi. Otherwise it is chosen by convenience or rules of thumb, rather than by 
elicitation of the real goal, for instance preferring the Euclidean norm over p ^ 2 
norms. If we subscribe to the procedure of choosing the loss function, we could ask 
whether this may be done in a more principled way. Consider a (not too large) 
class of loss functions Loss a , indexed by some parameter a. For instance, Loss" = 
\\y — y\\ a from the previous paragraph. The regularized loss §5§ also constitutes a 
class of losses. In this case we minimized over the regularization parameter a. This 
suggests to choose in general the loss function that has minimal loss rank LR". The 
justifications are similar to the ones for minimizing LR" w.r.t. r. Note that the term 
logt>£ cannot be dropped anymore, unlike in ([111]) . 
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(EL^-m^ = Wv-Mp = W-M)y\\ P 

{y' G R n : \\(I-M)y'\\ p < L} = {{I-M)~ x z G R n : ||z|| p < L} 
\{zeR n :\\z\\ p <l}\ = 2"nr=i 1 ^/ 1 ?!> 



9 Self-Consistent Regression 



So far we have considered only "on-data" regression. LoRP only depends on the 
regressor r on data D and not on x ^ {xi,...,x n }. We now construct canonical re- 
gressors for off-data x from regressors given only on-data. First, this may ease the 
specification of the regression functions, second, it is a canonical way for interpola- 
tion (LoRP can't distinguish between r that are identical on D), and third, we show 
that many standard regressors (kNN, Kernel, LBFR) are self-consistent in the sense 
that they are canonical. We limit our exposition to linear regression. 

Off-data regression. A linear regressor is completely determined by the n func- 
tions rrij ([6]), but not by the matrix function M ([7]). Indeed, two sets {rrij} and {m'j} 
that coincide on D = (x,y), i.e. rrij(xi\x) =m'-{xi\x) Wi ,j but possibly differ for x<^x, 
lead to the same matrix Mij(x) =rrij(xi\x) = m'j(xi\x). LoRP has the advantage of 
only depending on M, but this also means that it cannot distinguish between an rrij 
that behaves well on x^Lx and one that, e.g., wildly oscillates outside x. 

Typically, the rrij are given and, provided the model complexity is chosen ap- 
propriately e.g. by LoRP, they properly interpolate x. Nevertheless, a canonical 
extension from M to rrij would be nice. In this way LoRP would not be vulnerable 
to bad rrij, and we could interpolate D (predict y for any xEX) even without rrij 
given a-priori. 

We define a self-consistent regression scheme based only on M (for all n). We 
ask for an estimate y of y for x^x. We add a virtual data point {xq^q) to D, where 
xo = x. If we knew yo = y we could estimate yo = r(xo\{(xo,yo)}UD), but we don't 
know yo. But we could require a self-consistency condition, namely that yo = yo for 

Xq^lX. 

Definition 16 (canonical and self-consistent regressors) Let M'^x^oKijKn 
be the regression matrix for the data set D' = {(x ,?/o)}U-D = ((xo,x) ,(yo,y)) = (x' ,y') 
of size n+1. 

(i) A linear regressor yo = ?{xq\D) is called a canonical regressor for M' if the 
consistency condition y = r(xo\D') =Y^=o^ojV3 holds \/xo,D. 

(ii) A regressor r is called self- consistent iff = r, i.e. if 
r(x \{(x ,r(x \D))}UD)=r(x \D) Vx ,D. 

(Hi) A class of regressors TZ={r} is called self- consistent if R = {r} CTZ. 

We denote the solution of the self-consistency condition y = Y^^o^ojVj ^ ^o- 
So we have to solve 

y = 2^M 0jyj + M 00 y y = 3 J = ' 3 

j=i oo Z^j=i m oj 

where the last equality only holds if J2j=o^oj = 1> which is often the case, in par- 
ticular for kNN and Kernel regression, but not necessarily for LBFR. 
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Proposition 17 (canonical regressor) The linear regressor 

" M'(x') 

Vo = r(x \D) := },m j (x \x)y j , where mj(x \x) := 

■ =1 1 M ool !E J 

is the unique canonical regressor for M' (if M' m <1). 

Example 18 (self-consistent kNN, |ExJ25 M' Qj {x') = J for G N' k {xo) and 
else. The nearest neighbors A/^(xo) of Xo among a;' consist of Xq and the fc — 1 
nearest neighbors A4-i(^o) = : <^ of xo among x, i.e. N' k {xo) = {xo}UA4-i(^o)- Hence 

2^j=i lvl oj Z^jeJ k j( zj j=1 

Canonical kNN is equivalent to standard (k-l)NN, so the class of canonical kNN 
regressors coincides with the standard kNN class. ^> 

Example 19 (self-consistent kernel) 

, , / \ K{xQ, Xj ) _ YTj=l K {XQ,Xj) yj 

Canonical kernel regression coincides with the standard kernel smoother. ^> 
Example 20 (self-consistent LBFR) 



B' = ^2<l>(x i )(f>(x i ) T = B + <j>(x )(j>(x ) 
M' Qi = cPixofB'- 1 ^) = cf>(x 



! _ B-^ixoWxoYB-^ 
1 + cf>(x o yB-i(f>(xo) 



4>{xj 



M MM, = J^_ I 

K >3 1 i A/T 1 i u 00 



1 + Mqo 1 + Mqo uu l + M o 



In the first line we used the Sherman-Morrison formula for inverting B' . In the 
second line we defined M j = </)(xo) T 5~ 1 0(xj), extending M. 



JJo 



oo j=1 j=1 



Canonical LBFR coincides with standard LBFR. (} 

Proposition 21 (self-consistent regressors) Kernel regression and linear basis 
function regression are self- consistent. kNN is not self- consistent but the class of 
kNN regressors TZ = {rkNN'-k^JN} is self- consistent. 

To summarize, we expect LoRP to select good regressors with proper interpola- 
tion behavior for canonical and self-consistent regressors. 
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10 Nearest Neighbors Classification 



We now consider k-nearest neighbors classification in more detail. In order to get 
more insight into LoRP we seek a case that allows analytic solution. In general, the 
determinant detS^ cannot be computed analytically, but for x lying on a hypercube 
of the regular grid X = ZZ d we can. We derive exact expressions, and consider the 
limits n— >oo, k — >oo, and d— >oo. 

kNN on one-dimensional grid. We consider the d—1 dimensional case first. We 
assume x — (1,2,3, ...,n), a circular metric d(xi,Xj) = d(i,j) = min{|i— j\,n— \i— j\}, 
and odd k < n. The kNN regression matrix 

Mij = bi-j with = i if d(i,j) < ^ and otherwise 

is a diagonal-constant (Toeplitz) matrix with circularity property b^j —bi-j +n . For 
instance, for A; = 3 and n = 5 

(110 1' 
1110 
1110 
111 
10 11, 

For every circulant matrix, the eigenvectors v 1 ,...,v n are waves v l - = 9 31 with 9 = 
e 27rv/ ~ T//n . The eigenvalues are the fourier transform bi = ^J- =1 bj9 ~i l of b, since 
^2jM i jV l j = J2jbi-j9 jl = J2jbj^ l ~^ 1 = v iJ2jbj@~ jl = hv\, where we exploited circularity 
of b and 9^ 1 . For MkNN in particular we get 



lA rJi _ 1 9 lk ' 2 - 9- lk ' 2 sm(%lk/n) 



k k 9 l / 2 -9- 1 / 2 ksm(irl/n) 

t 3=-^ t t 

circularity geometric sum insert 9 



< 1 for I j£ n 



and b n = l. The only 1-vector v n = l corresponds to a constant shift Ui^Ui+c under 
which kNN (like many other regressors) is invariant. Instead of regularizing LoRP 
with a>0 we can restrict V(L) C M 1 to the space orthogonal to v n , which means 
dropping b n = l in the determinant. Intuitively, since this invariant direction is the 
same for all k, we can drop the same additive infinite constant from LR for every 
k, which is irrelevant for comparisons (formally we should compute lim^^olLR^ — 
LRp! 2 }). The exact expression for the restricted log-determinant (denoted by a 
prime) is 

n-l 

-§logdet'S = -logdet'(l-M) = -^tog(l-S,) =: \c\ k = c l nk tvM 

i=i 

For large n (and large k) the expression can be simplified. The exact, large n, and 
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large k<^n expressions are 



c\ = --Vlog ( 1 _ M*tyn) \ 

Cook = — / log 1 1 - 7— TT ) 

7T J_ w/2 \ ksm(z)J 



k f n/2 ^ sin(kz) ^ j_ (z = nl/n for I < § 

z = irl/n — n else 



c 1 

oooo 



1 C 00 / sint\ 

- / log 1 dt = 3.202 (t = kz, sin(z) ~ z) 

TT J-oo v * / 



Further, 0^3 = 3^3=3.295. Since is decreasing in k, equals 3.2 within 3% 
for all k. 

kNN on d-dimensional grid. We now consider x = X d = {l,...,rii} d on a d- 
dimensional complete hypercube grid with n — nf points and Manhattan distance 
d(xi,xj) =d(i,j) = Y^a=id j i(.ia,ja) for all Xi = iEX d and xj = j G X d , where d\ is the 
one-dimensional circular distance defined above (so actually X d is a discrete torus). 
For k = kf, the neighborhood Mk{x) of x is a cube of side-length k x . In this case, 
M — M\®...®M\ is a d-fold tensor product of the Id kiNN matrices Mi of sample 
size ni. The eigenvectors of M are v h ®...®v ld with eigenvalues bi 1 -...-bi d . We get 

ni-l n d ~l 



-logdet'(l-M) = ^ log(l - V ... • (30) 



ft oooo 

„ 1 1/(1 ' 
a=l 

For instance, for d—2, numerical integration gives 0^^=2.2 compared to 3.2 in one 
dimension. For higher dimensions, evaluation of the d-dimensional integral becomes 
cumbersome, and we resort to a different approximation. 

Taylor series in M. We can also (not only for kNN) expand logdetS'o in a Taylor 
series in M: 

oo 

-logdet'(l-M) = -tr'log(l-M) = J^±tr'(M s ) 

8 = 1 

OO OO 

= ^r>'Mo rf = == f4 

s=l s=l 

where we used ti(A®B) =ti(A) -ti(B) and (A®B)" = A s (g)B s and defined 

A nikls := ^tr'(Mf) = ± f° f^Vdt 

The one-dimensional integral can be expressed as a finite sum with s terms or 
evaluated numerically. For any n and k one can show that A n k\ = A n k2 = 1 > A n k s for 
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s > 2. So the expansion above is useful for large d. Note also that c^ fc is monotone 
decreasing in d. For ti-^oo we have 

oo 

c = EK^r = 1+1+0+... = § 

8=1 

i.e. c d nk decreases monotone in d from about 3.2 to |. 

The practical implication of this observation, though, is limited, since k = kf-^-oc 
is actually not fixed for d — > oo. Indeed, in practical high-dimensional problems, 
k<^n^3 d , but in our grid example k = kf>3 d . Real data do not form full grids but 
sparse neighborhoods if d is large. 

11 Conclusion and Outlook 

We introduced a new method, the Loss Rank Principle, for model selection. The loss 
rank of a model is defined as the number of other data that fit the model better than 
the training data. The model chosen by LoRP is the one of smallest loss rank. The 
loss rank has an explicit expression in case of linear models. Model consistency and 
asymptotic efficiency of LoRP were considered. The numerical experiments suggest 
that LoRP works well in practice. A comparison between LoRP and other methods 
for model selection was also presented. 

In this paper, we have only scratched at the surface of LoRP. LoRP seems to be 
a promising principle with a lot of potential, leading to a rich field. In the following 
we briefly summarize miscellaneous considerations. 

Comparison to Rademacher complexities. For a (binary) classification prob- 
lem, the rank (JTJ of classifier r can be re-formulated as the probability that a ran- 
domly relabeled sample y' behaves better than the actual y. The more flexible r 
is, the larger its rank is. The Rademacher complexity [KolOR BBL02J of r is the 
expectation of the difference between the misclassifying loss under the actual y and 
the misclassifying loss under a randomly relabeled sample y'. The more flexible r is, 
the larger its Rademacher complexity is. Therefore, there is a close connection be- 
tween LoRP and Rademacher complexities. Model selection based on Rademacher 
complexities has a number of attractive properties and has been attracting many 
researchers, thus it's worth discovering this connection. Some results have been re- 
cently already obtained, however, to keep the present paper not so long, we decide 
to present the results in another paper. 

Monte Carlo estimates for non-linear LoRP. For non-linear regression we 
did not present an efficient algorithm for the loss rank/volume LR, r (y\x). The 
high-dimensional volume |V^.(L)| Q may be computed by Monte Carlo algorithms. 
Normally V r (L) constitutes a small part of y n , and uniform sampling over y n is not 
feasible. Instead one should consider two competing regressors r and r' and compute 
| VnV'j/l V| and | VnV'|/|V'| by uniformly sampling from V and V respectively e.g., 
with a Metropolis-type algorithm. Taking the ratio we get |V'|/|V| and hence the 
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loss rank difference LR r — LR r /, which is sufficient for LoRP. The usual tricks and 
problems with sampling apply here too. 

LoRP for hybrid model classes. LoRP is not restricted to model classes indexed 
by a single integral "complexity" parameter, but may be applied more generally to 
selecting among some (typically discrete) class of models/regressors. For instance, 
the class could contain kNN and polynomial regressors, and LoRP selects the com- 
plexity and type of regressor (non-parametric kNN versus parametric polynomials). 

Generative versus discriminative LoRP. We have concentrated on counting y's 
given fixed x, which corresponds to discriminative learning. LoRP might equally well 
be used for counting (x,y), as alluded to in the introduction. This would correspond 
to generative learning. Both regimes are used in practice. See |LJ08] for some recent 
results on their relative benefit, and further references. 
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Appendix: List of Abbreviations and Notations 

AIC= Akaike Information Criterion. 

BIC= Bayesian Information Criterion. 

BMS= Bayesian Model Selection 

kNN= k Nearest Neighbors. 

LBFR= Linear Basis Function Regression. 

LoRP= Loss Rank Principle. 

LRC = Loss Rank Code. 

MAP= Maximum a Posterior. 

MDL= Minimum Description Length. 

ML= Maximum Likelihood. 

PML= Penalized Maximum Likelihood. 

D = {(xi,y 1 ),...,(x n ,y n )}= observed data. 

V = {D}= set of all possible data D. 

X x 3^=observation space. 

x = (xi,...,x n )= vector of x-observations, similarly y. 

f:X—>y= functional dependence between x and y. 

J- = ("small") class of functions /. 

%— class of stochastic hypotheses/models. 

r:T>— >J r = regressor /mo del. 

yi = r(xi\D)= r-estimate of yi. 

TZ— ("small") class of regressors/models. 

w&M d = parametrization of J-j. 

Mk{x)= set of indices of the k nearest neighbors of x in D. 
L = Loss r (D) = Loss(y,y)= empirical loss of r on D. 
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Rank r (L) = <Ey n : Loss r (y'\x) < L}= loss rank of r. 

V(L)= volume of D under r. 

LK r (y\x) = log rank/volume of D. 

LR"= regularized LR r . 

d e ff = effective dimension. 

rrij(x,x)= coefficients of linear regressor. 

M(x)= linear regression matrix or "hat" matrix. 

log= natural logarithm. 

a^b: a is replaced by b. 
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